Add initial end-to-end CUDA FGMRES solver path#2825
Conversation
8982fcb to
d821ca0
Compare
592b302 to
5b01f21
Compare
Move the FGMRES iteration into one shared implementation and select host or CUDA vector-operation backends from the existing solver entry point. Keep the CUDA path GPU-resident for SpMV, Jacobi, dot/norm, and vector updates, with only scalar reductions and final solution synchronization crossing back to the host.
5b01f21 to
a875f56
Compare
| namespace VecExpr { | ||
|
|
||
| enum class DeviceAssignOp { Assign, Add, Subtract, Multiply, Divide }; | ||
|
|
||
| template <class Scalar> | ||
| class CDeviceVectorView : public CVecExpr<CDeviceVectorView<Scalar>, Scalar> { |
There was a problem hiding this comment.
This is a step in the right direction, but what I have in mind is to use CSysVector directly, so that CSysSolve can stay nearly identical and completely agnostic to cpu or gpu.
I think we can do this by specializing the store_t trait for CSysVector, so that the type stored by expressions becomes this view, note that we only need to capture the pointer, the size is defined by the vector on the left-hand side of the assignment or compound assignment.
This way CSysVector either does the CPU loop or launches the GPU kernel in the assignment, according to how the GPU boolean is set.
There was a problem hiding this comment.
For clarity, CSysVector stops being stored as a reference and is instead stored "as a view" (which is a value type).
Proposed Changes
This PR adds an initial end-to-end CUDA FGMRES linear solve path on top of the existing CUDA BSR SpMV path.
It intentionally bundles the minimal pieces required for a reviewable GPU linear-solve slice, rather than sending the intermediate infrastructure-only pieces separately. The scope is limited to one GPU Krylov solver path (
FGMRES), one simple GPU preconditioner path (JACOBI), and the vector operations and dispatch/lifecycle changes strictly required to make that path run.Concretely, this PR:
cuSPARSEfor SpMVcuBLASfordot/normCSysVectorexpression-template vector updates to custom CUDA kernels instead of exposing a parallel solver-visible GPU vector algebra APIThis PR does not attempt to add more GPU Krylov solvers, more advanced GPU preconditioners, remove the current host-driven Krylov control flow, or perform broader cache / portability / cleanup work beyond this minimal slice.
Related Work
This PR follows the review direction discussed in #2822, where the request was to show a working end-to-end GPU linear solve path before splitting out additional infrastructure work.
It also follows the implementation preferences discussed in #2816:
cuSPARSEfor SpMVcuBLASfordot/normSuggested review order:
53bacf193fCache CUDA SpMV cuSPARSE resources08fde80e1eAdd CUDA FGMRES and Jacobi scaffoldingfde2c145cfImplement CUDA vector primitives2b4f9d8716Implement CUDA FGMRES solve path9c344ee793Implement CUDA Jacobi preconditionera875f56767Share FGMRES control flow with CUDA vector dispatchValidation
Validated locally with:
python3.12 -m pre_commit run --all-filesLINEAR_SOLVER_PREC=NONEandLINEAR_SOLVER_PREC=JACOBInsysprofilingncuprofilingRepresentative cases used for validation:
periodic2d_sectorudf_lam_flatplate_sudf_lam_flatplate_mudf_lam_flatplate_ludf_test_11_probes_sudf_test_11_probes_mIn short: this branch compiles, the end-to-end CUDA FGMRES path runs successfully on the tested cases, and the GPU-side results are numerically consistent with the CPU-side results. Across the tested cases, the CPU and GPU residual histories either match exactly or differ only at floating-point roundoff level.
Performance was also checked on the same representative cases against both a serial CPU build and a 20-thread OpenMP CPU build. The GPU path is faster than the serial CPU baseline on the medium and large cases tested here. Against the 20-thread OpenMP CPU baseline, it is not beneficial on the smallest cases, but still shows a clear speedup on the medium and large cases tested here.
The simple Jacobi path is numerically valid, but is not yet a net performance win on these cases.
PR Checklist
pre-commit run --allto format old commits.